T23 SpilloverDiD tutorial: TVA-style ~40% understatement walkthrough#493
Conversation
|
Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology No findings. The tutorial’s ring-regressor form, Code Quality No findings. Performance No findings. Maintainability
Tech Debt No findings. Dropping the delivered tutorial row from Security No findings. Documentation/Tests
|
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology Affected methods reviewed:
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability No findings. The notebook/test DGP duplication is adequately defended by the constant-sync and AST-sync guards in Tech Debt No findings. Removing the delivered tutorial row from Security No findings. Documentation/Tests No additional findings. The docs wiring and references are consistent across Residual risk: this was a static review only; I could not run |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology No findings. The tutorial’s API note for Code Quality No findings. Performance No findings. Maintainability No findings. The notebook/test DGP duplication is adequately protected by constant-sync and AST-sync guards at Tech Debt No findings. Removing the delivered tutorial row is appropriate, and the remaining SpilloverDiD follow-ups remain tracked at Security No findings. Documentation/Tests No findings. The documentation surfaces are internally consistent, and the new test file covers the main narrative claims well. Residual risk: I could not run the drift suite locally because |
New tutorial docs/tutorials/23_spillover_tva.ipynb (slot 23 since 22_had_survey_design.ipynb occupies slot 22) walks practitioners through SpilloverDiD on a TVA-style synthetic panel reproducing the Butts (2021) §4 Table 1 Panel A ~40% understatement direction. DGP locked at MAIN_SEED=23: 4-period, 200-unit panel laid out as 25 treated + 120 near-control + 55 far-control. With tau_total=-7.4 and delta_1=-4.5, naive multi-period TWFE understates by 42% (att=-4.29); SpilloverDiD with rings=[0.0, 100.0] cleanly recovers tau_total=-7.34 and delta_1=-4.53. Section 6 demonstrates Conley spatial-HAC variance with conley_cutoff_km=100 per Butts §3.1. Drift tests at tests/test_t23_spillover_tva_drift.py: 14 function- level tests mirroring the T19 pattern. Module-scoped panel / naive_fit / spillover_fit / spillover_conley_lag0_fit / spillover_conley_lag1_fit fixtures. Per feedback_t19_drift_guards_test_file_only, the notebook contains zero assert statements — all numerical pins live in the test file. The DGP-builder function is duplicated verbatim between the notebook §2 cell and the test fixture; test_dgp_true_parameters_match_quoted pins the constants both copies share. Cross-references: - docs/index.rst: toctree entry under Tutorials: Business Applications. - docs/methodology/papers/butts-2021-review.md:257: T22→T23 (slot conflict callout). - docs/references.rst: adds Kline & Moretti (2014) — cited in §1. - TODO.md: drops the resolved T22-tutorial row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P1 fix: the §3 "naive multi-period TWFE" baseline was fit without unit FE absorption — only the implicit time-interaction was applied. ATT was correct (balanced panel makes pooled and TWFE coincide on point estimates) but the SE was the pooled SE (~0.28) not the TWFE SE (~0.14). Adds `absorb=["unit"]` to both the notebook §3 fit and the drift-test `naive_fit` fixture, making the regression a true two-way fixed-effects panel TWFE per the MultiPeriodDiD contract documented at `docs/methodology/REGISTRY.md:L119-L141`. All drift tests pass unchanged (ATT pin is invariant under unit-FE absorption on this balanced DGP). P2 fix: §6 attributed the lower Conley SE to "negative cross-covariance between treated and far-control PSUs at long distances" — but the far controls sit at 200-330 km from origin while `conley_cutoff_km=100`, so those pairs are OUTSIDE kernel support and cannot drive the variance. Rewrites the explanation to refer to the within-cutoff covariance structure (treated cluster + near-control band, both pairwise within 100 km) and notes that the sign of the net effect on SE depends on the per-pair score covariances under this specific DGP. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex R2 caught a terminology mismatch in §6: the explanation said "only PSU pairs within 100 km contribute" but the fit shown is `SpilloverDiD(vcov_type="conley")` with no `survey_design`, so the no-survey Conley path runs an observation-level pairwise sum over within-period score pairs — PSU aggregation only kicks in on the survey-design path per `docs/methodology/REGISTRY.md:L3353-L3357` and `:L3549-L3555`. Rewrites the §6 mechanism paragraph to use "score pairs" / "within-period observation pairs" language and explicitly notes the no-survey scope. Also clarifies that the spatial term's outside- support statement applies to the spatial sum specifically; the serial term (`lag=1`) is a separate within-unit cross-period covariance sum. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P2 fixes (warning suppression / direction-pinning gap):
- Naive MultiPeriodDiD fit now passes `reference_period=2` explicitly,
silencing the FutureWarning about the reference-period default
change at the source. Bit-identical to the prior fit (the new
default already matches the explicit value).
- Replaces blanket `simplefilter("ignore")` on the MPD fit with a
message-scoped filter limited to the expected
"Rank-deficient design matrix" UserWarning. That warning is benign
here: `absorb=["unit"]` makes the unit-invariant `ever_treated`
indicator perfectly collinear with the unit FE, so MPD drops it and
identifies the ATT through the ever_treated × post interactions —
the standard TWFE specification. Now any unexpected warning would
surface.
- Adds `test_conley_se_less_than_hc1` to pin the §6 prose claim
"on this DGP, the Conley spatial-HAC SE comes in *lower* than HC1"
directionally. The pre-existing `test_conley_se_differs_from_hc1`
is retained as a direction-agnostic sanity check.
P3 fix (§4 d_bar=50 mechanism):
The previous wording said `delta_1` attenuates because the ring
"averages over fewer units" — incorrect under this constant-`delta_1`
DGP. The real mechanism is misspecification: near-controls at 50-78 km
ARE exposed (within the true 100 km spillover horizon) but the
`d_bar=50` ring misclassifies them as far-away clean controls, so
both `tau` deflates (contaminated control arm) and `delta_1`
attenuates (cleaner near-controls now mixed into the `S=0` comparison).
Rewrites the §4 paragraph accordingly and references the registry's
documented failure mode for undershooting `d_bar`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ings
P1 fix (§6 pairwise Conley support):
The previous §6 rewrite said far-control band is "outside kernel
support" and "contributes nothing beyond own diagonal" — wrong.
Conley kernel support is determined PAIRWISE between observations,
not relative to the treated cluster. The far-control band lat extent
is ~111 km, so most far/far pairs ARE within 100 km of each other
and DO contribute non-zero Bartlett weight to the spatial Conley meat.
Rewrites §6 to enumerate which pair classes contribute under this
DGP's geometry (treated×treated, treated×near, near×near, AND far×far
within-band) vs which are outside the 100 km support (treated×far,
near×far cross-band). The Conley < HC1 result on this DGP comes from
the net sign of ALL the within-100-km cross-covariance terms, not
just the treated-near block.
P2 fix (narrow warning suppression + warning-policy guard):
Replaces blanket `warnings.simplefilter("ignore")` around every
SpilloverDiD fit with a message-scoped filter for the three known
benign `*encountered in matmul` RuntimeWarnings the stage-2 sandwich
emits from a per-unit FE projection edge case. Notebook §4/§5/§6 fits
and the corresponding drift fixtures all route through
`_silence_spillover_matmul_warnings()` for a single source of truth.
Adds `test_spillover_fit_emits_only_known_matmul_warnings` (T19-pattern
warning-policy guard) that runs the headline fit under
`warnings.simplefilter("always")` and asserts ONLY the three known
matmul warnings are emitted — any new warning category, or new
RuntimeWarning message, surfaces immediately instead of being silently
masked.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Completes the warning-suppression narrowing from R4. R5 caught three
spots still using blanket `warnings.simplefilter("ignore")`:
- Notebook §6 `fit_with()` helper (the Conley-vs-HC1 comparison cell).
- Drift test `test_rings_sensitivity_grid_endpoints` loop body.
- Drift test `test_rings_grid_d_bar_100_to_200_identical` loop body.
All three now use either the notebook-side narrow `RuntimeWarning`
filter for `.*encountered in matmul` or the drift-test helper
`_silence_spillover_matmul_warnings()`, matching the §5 fit fixture
pattern. Single source of truth for the suppression contract.
Adds `test_spillover_conley_fit_emits_only_known_matmul_warnings`
parallel to the §5 warning-policy guard, locking the warning surface
of the §6 Conley-path fits (lag=0 spatial-only and lag=1
spatial+serial). Refactors the assertion body into a shared helper
`_assert_only_known_matmul_warnings()` used by both §5 and §6 guards.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R6 caught a methodologically-misleading phrasing in §6: the previous paragraph attributed the Conley<HC1 result on this DGP to "positive autocorrelation of scores within the far-control block", but in the Conley sandwich formula positive off-diagonal score covariances ADD to the variance (they make the meat larger), not shrink it. Replaces that sentence with a sign-agnostic explanation: the Conley meat is HC1 diagonal terms PLUS Bartlett-weighted off-diagonal score cross-products from within-support pair classes, and the sandwich ATT variance depends on the NET sign of those terms — on this seed they net out below HC1, but the direction is a feature of the data/design and not attributable to any single pair class. No behavior or numerical change; prose-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R7 caught two asymmetries in the warning-policy guard:
1. The assertion only rejected "unexpected" warnings but didn't
pin presence of the three known matmul flavors. If upstream
fixed one (so e.g. "overflow encountered in matmul" stopped
firing), the test would still pass and the notebook prose /
comments referencing "three known warnings" would silently go
stale.
2. The silence filter used `.*encountered in matmul` as a regex
pattern, which would also silence a hypothetical 4th flavor
("underflow encountered in matmul") without surfacing it.
Renames the helper to `_assert_exact_matmul_warning_surface` and
extracts the three exact expected messages into a module-level
constant `_EXPECTED_MATMUL_MESSAGES`. Both halves of the contract
are now asserted:
- (a) every captured warning matches one of the three EXACT
expected messages (rejects new flavors / categories);
- (b) ALL THREE messages are present at least once (rejects
silent disappearance of any of them).
`_silence_spillover_matmul_warnings()` is also tightened to install
one `filterwarnings("ignore", category=RuntimeWarning, message=m)`
per exact expected message string, so a new matmul flavor is no
longer auto-silenced at fixture level — it surfaces immediately in
the guard test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R8 noted that the R7 fix asserted presence/absence of the three known matmul warnings but didn't pin their multiplicity. If upstream starts firing one of the same messages twice (e.g. a stage-2 loop refactor emits "divide by zero" 2x instead of 1x), the assertion still passes silently, while the notebook's repeated "three benign matmul warnings" claim becomes stale. Tightens the guard helper to assert EXACT multiset equality via `Counter(observed) == Counter(_EXPECTED_MATMUL_MESSAGES)` — each expected message must fire exactly once. Adds a separate `non_runtime` precheck that surfaces any unexpected warning category (UserWarning, FutureWarning, etc.) with a category-named error. Verified the current multiplicity is 1x each across all three fits (HC1, Conley lag=0, Conley lag=1) before tightening; if a future library change shifts that count legitimately, the failure message lists observed vs. expected counts so a maintainer can update both `_EXPECTED_MATMUL_MESSAGES` / the multiplicity expectation AND the §5/§6 notebook narrative. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R9 caught two P3 cosmetic accuracy issues:
1. The §2 panel-layout table and §6 pair-enumeration both quoted
geometry numbers that don't match the seeded DGP. The "within
~5 km of (0,0)" treated-cluster phrasing is the expectation of a
single `rng.normal(0, 0.05)` draw, but at seed 23 the treated
cluster max distance from origin is 11.6 km and the pairwise
diameter is 22.3 km. Similarly the near-band extent (11-78 km)
and far-band extent (~111 km lat) were stylized round numbers
rather than the seed-23 actuals.
Updates the §2 table and the §6 pair enumeration to the seed-23
measured values:
- treated cluster: max ~12 km, diameter ~22 km
- near band: ~12-82 km
- far band: ~224-331 km (lat extent ~131 km)
- far x far within 100 km: ~95% (median pair dist ~56 km)
- near x near within 100 km: 100%
The notebook's substantive claims (recovery / sensitivity grid /
Conley < HC1) are unchanged; only the geometric descriptions.
2. CHANGELOG entry says the drift file has "14 function-level
tests" — the file now has 17 (after the R3/R4/R5 additions of the
`test_conley_se_less_than_hc1` direction pin and the §5/§6
warning-policy guards). Updates the count and lists the new
coverage surface (direction pin + exact warning multiset).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI codex flagged a merge-blocker P1: my warning-policy guards required the three `*encountered in matmul` RuntimeWarnings to fire on every SpilloverDiD fit (Counter-multiset equality of an exact expected multiset). But per TODO.md "RuntimeWarnings in Linear Algebra Operations", those warnings are an Apple Silicon M4 + macOS Sequoia + numpy<2.3 Accelerate BLAS artifact (numpy#28687, fixed in numpy>=2.3) and DO NOT fire on M3 / Intel / Linux. The pure-Python CI job runs Ubuntu with latest numpy, so my multiplicity assertion would fail even on a fully-correct fit. Switches to the T19 platform-agnostic pattern (mirrored from tests/test_t19_marketing_pulse_drift.py:221-260): - Drift test installs the narrow `.*encountered in matmul` filter INSIDE the warning-capture block, then asserts the post-filter captured list is empty. Renamed both guards to `test_*_warning_policy_post_filter_clean` to reflect the contract. - New helper `_assert_post_filter_warning_surface_is_clean()` replaces the old `_assert_exact_matmul_warning_surface` (multiplicity-pin) + `_EXPECTED_MATMUL_MESSAGES` constant. - `_silence_spillover_matmul_warnings()` reverts to a single broad `r".*encountered in matmul"` filter (no message-exact-per-flavor list — narrowing wasn't needed once the assertion stopped requiring presence). Docstring now cites TODO.md and the Apple BLAS root cause instead of the misattributed "per-unit FE projection edge case". On Apple Silicon M4 + numpy<2.3: warnings fire, filter catches them, assertion passes. On M3 / Intel / Linux or numpy>=2.3: warnings don't fire, filter is a no-op, assertion still passes. Either way, any unexpected UserWarning / FutureWarning / non-matmul RuntimeWarning fails the guard. Notebook §4 comment updated to attribute the matmul filter to the Apple Silicon BLAS issue (with TODO.md pointer). §5 and §6 inherit the cross-reference via "See §4 for the rationale ...". CHANGELOG entry tightened: drops the (now incorrect) "multiplicity shift surfaces immediately" claim and explains the platform-agnostic post-filter contract instead. Local: 17 drift tests + nbmake still pass on Apple Silicon (warnings fire and get filtered). CI Linux: the filter is a no-op and the assertion still passes because the SpilloverDiD fit emits no warnings when the BLAS bug isn't triggered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…≥100 grid
CI codex R2 noted the §4 narrative claim that `d_bar in {100, 150,
200}` produces identical estimates covers BOTH `tau_total` AND
`delta_1`, but the exact-identity drift guard
`test_rings_grid_d_bar_100_to_200_identical` only pinned `tau_total`.
`delta_1` was checked via the round-to-1 grid endpoints
(`test_rings_sensitivity_grid_endpoints`), which would still pass if
a future change introduced a small (<0.05) delta_1 drift between the
three d_bar values while keeping the rounded endpoints unchanged.
Adds companion test `test_rings_grid_d_bar_100_to_200_identical_delta_1`
that asserts `delta_1` is bit-equal (atol=1e-10) across d_bar ∈ {100,
150, 200} on the locked panel, mirroring the tau_total identity test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cosmetic P3s from CI codex R3: - docs/tutorials/README.md: catalog stopped at T22; added a T23 entry in the same format (overview + companion drift-test pointer). - CHANGELOG.md: said "17 function-level tests" — file now has 18 after the d_bar≥100 delta_1 identity companion test landed in commit 1248c44. Bumped to 18 and also surfaced the new exact- identity claim ("tau_total AND delta_1 ... per Butts §4 'once d_bar covers the true horizon, widening is benign'"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI codex R4 caught two real gaps in the drift coverage: P2 (Maintainability): the test file's docstring claims the DGP-builder is duplicated "verbatim" from the notebook §2 cell, but only the parameter CONSTANTS are pinned (`test_dgp_true_parameters_match_quoted`). Non-constant edits (coordinate ranges, lambda_t, row construction) could drift silently if the headline numbers stay within tolerance. Adds `test_notebook_dgp_ast_matches_test_fixture` — parses the notebook JSON, extracts the §2 `build_t23_panel` FunctionDef, and compares its AST (with docstring stripped, function name normalized) against `_build_t23_panel`'s. Uses `ast.dump` for whitespace- and comment-agnostic semantic equality. Any DGP-logic divergence between the two copies now fails loudly; cosmetic-only edits (whitespace, comments) don't trigger spurious failures. P3 (Documentation/Tests): §2 quotes seed-specific geometry numbers (max ~12 km, cluster diameter ~22 km, near 12-82 km, far 224-331 km) and §6 quotes pair-support percentages (far×far ~95% within 100 km, near×near 100%). Drift tests only pinned the band counts and lat bounds, so those prose details could drift silently. Adds `test_seed_specific_geometry_pins_match_quoted` — recomputes each quoted value from the seed-23 panel using haversine-deg-to-km arithmetic and asserts they match the notebook narrative integers (rounded). If a future RNG/geometry change shifts any number outside the rounded value, the test fails and the maintainer must update either the prose or the layout parameters. 20 drift tests pass (16 → 20: +AST sync, +geometry pin, + `test_rings_grid_d_bar_100_to_200_identical_delta_1` from R3, + the R3 §6 warning-policy guard). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three P3 refinements from R5:
1. (Doc) §6 said the far-band "lat extent" is ~131 km. That's wrong:
the latitude span is `[2.0, 3.0]` * 111 km ≈ 107 km of TRUE lat
extent; the 131 km figure is the PAIRWISE BAND DIAMETER (the
`np.sqrt(dlat² + dlon²).max()` quantity the test actually computes).
Renames the prose to "max within-band pairwise distance is ~131 km"
and adds an explanatory comment in the test where the geometry was
computed.
2. (Maintainability) `test_seed_specific_geometry_pins_match_quoted`
claimed it pinned "all the values quoted in the notebook", but two
§6 values were unpinned: "treated × near-control pairs ≤ ~94 km
separation" and "median far/far pairwise distance ~56 km". Also
the §6 prose claimed ≤ ~94 km but the actual max treated × near
distance at seed 23 is ~89.7 km, so I updated the prose to
"≤ ~90 km" to match (more accurate) and added two new helper
functions + assertions:
- `_cross_band_max_pair_km(treated, near) == 90`
- `_within_band_median_pair_km(far) == 56`
3. (Doc) `CHANGELOG.md` understated coverage at 18 tests; the file
now defines 20 (R4 added AST sync + geometry pins). Bumped to 20.
Drift suite: still 20 tests; numerical content unchanged. Notebook
prose: §6 wording reflects geometric reality.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R6 noted that my AST sync test only compared the `build_t23_panel` function body, but the notebook §2 cell also has 10 cell-level constant assignments above the function (MAIN_SEED, N_TREATED, N_NEAR, N_FAR, T_PERIODS, FIRST_TREAT, TAU_TOTAL, DELTA_1, D_BAR_KM, NOISE_SD). A notebook-only edit to any of those would change tutorial behavior without failing either `test_notebook_dgp_ast_matches_test_fixture` (only compares function body) or `test_dgp_true_parameters_match_quoted` (only reasserts the test module's own constants). Adds `test_notebook_dgp_constants_match_test_module_constants`: parses the notebook §2 cell, walks the top-level `ast.Assign` nodes, literal-evaluates each RHS, and asserts the 10 expected constants have values matching the test module. Any notebook-only constant drift now fails the guard with a (notebook_value, test_value) diff in the error message. CHANGELOG count bumped 20 → 21. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R7 noted that the notebook, README, and CHANGELOG publish 2-decimal
numbers (-4.29, -7.34, -4.53, -5.38, -2.59) but the drift tests only
enforce round-to-1 precision. Future drift smaller than 0.05 could
leave the prose stale without failing any test.
Mixed fix:
- WELL-CONDITIONED headline pins tightened to round-to-2 (matching
what the prose actually quotes):
* test_naive_att_endpoint_matches_quoted: -4.3 → -4.29
* test_spillover_did_recovers_tau_total: -7.3 → -7.34
* test_spillover_did_recovers_delta_1: -4.5 → -4.53
- BORDERLINE rings=[0,50] grid point left at round-to-1 (per the R5
reviewer's BLAS-safety guidance). Notebook §4 prose coarsened to
match: "-5.38" → "~-5.4" and "-2.59" → "~-2.6", with an explicit
parenthetical explaining WHY this point uses 1-decimal precision
while §5 uses 2 (borderline-rank-deficient design under d_bar=50
can shift across BLAS paths).
Headline values are stable across BLAS paths to better than 0.005
(verified on Apple Silicon; cross-platform variance on well-
conditioned fits is sub-ULP). The drift-test docstrings now
explicitly document the 2-decimal vs 1-decimal contract for future
maintainers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cision
R8 caught two P3 prose accuracy issues:
1. §6 Conley narrative said the no-survey path "operates on raw
scores" and the meat is "HC1 diagonal terms PLUS off-diagonal
cross-products". Both phrasings misstate the documented Wave D
path:
- The no-survey Conley path uses observation-level Wave D Gardner
GMM-corrected influence rows `psi_i` (per `diff_diff/two_stage.py:L83-L120`
and `docs/methodology/REGISTRY.md:L3341-L3363`), not raw scores.
- The Conley meat decomposition is diagonal self-product terms
(`psi_i psi_i'`) plus off-diagonal Bartlett-weighted pair terms
(`K(d_{ij}/h) psi_i psi_j'`). It does NOT apply an HC1 `n/(n-p)`
finite-sample multiplier — that lives on the HC1 path only.
Rewrites the §6 mechanism paragraph + cross-band paragraph to:
- Refer to "observation-level Wave D Gardner GMM-corrected
influence rows $\psi_i$" instead of "raw scores".
- Use the explicit decomposition $\psi_i \psi_i'$ + Bartlett-
weighted $\psi_i \psi_j'$ and add a parenthetical noting the
missing HC1 multiplier as a separate contributor to the
Conley < HC1 direction on this DGP.
2. CHANGELOG said "naive coefficient endpoint (round-to-1)" but the
actual drift test now pins -4.29 at 2 decimals. Updates the
CHANGELOG line to reflect the current contract — naive coefficient
at 2 decimals, headline tau/delta_1 at 2 decimals, borderline
`rings=[0,50]` at 1 decimal, and lists the additional sync /
geometry / constant pins added during the R4-R8 iteration cycle.
No code/test changes; prose-only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…attribution
R9 noted that README + CHANGELOG over-attributed the §6 demo
configuration `vcov_type="conley", conley_cutoff_km=100,
conley_lag_cutoff∈{0,1}` to "Butts §3.1". The paper §3.1 only
supports the spatial cutoff choice (`conley_cutoff_km = d_bar`); the
`conley_lag_cutoff` serial-Bartlett extension is the library's
documented Wave E.2 follow-up synthesis (Newey-West 1987 form on the
within-PSU/within-unit time axis, per REGISTRY "Variance (Wave E.2
follow-up)"). The notebook body §6 itself already drew this
distinction correctly; only the condensed summaries were imprecise.
Rephrases both summary sentences to:
"the cutoff = `d_bar` choice follows Butts §3.1, while the
`conley_lag_cutoff` serial extension is the library's documented
Wave E.2 follow-up synthesis with Newey-West-style serial Bartlett
HAC (per REGISTRY 'Variance (Wave E.2 follow-up)')".
No code/test changes; documentation attribution-only refinement.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…claim
R10 noted that CHANGELOG attributed the `d_bar ∈ {100, 150, 200}`
exact identity to "Butts §4 'once d_bar covers the true horizon,
widening is benign'", but that overstates the paper. The Butts paper
and the registry frame `d_bar` as a real bias/variance tradeoff; the
exact identity at THIS grid is a DGP-specific consequence of the
synthetic panel having no units in the 80-200 km band (the
near-control band tops out at ~78 km, and the far-control band starts
at ~224 km, so widening past 100 km adds zero observations to any
ring bin).
Rephrases the CHANGELOG entry to attribute the identity to the
tutorial DGP geometry and explicitly note that it is NOT a generic
result.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R11 noted that §7's survey-weights bullet conflated Wave E.2 (cross-sectional Conley + survey via stratified-Conley sandwich on PSU totals) with the Wave E.2 follow-up (panel-block spatial + serial Bartlett HAC; requires an effective PSU on the survey design — either explicit `survey_design.psu=` or `cluster=<col>` injected per the Wave E.1 warn-and-use-PSU pattern). Rewrites the bullet to distinguish: - `conley_lag_cutoff=0`: Wave E.2 stratified-Conley sandwich on PSU totals (cross-sectional, spatial only) - `conley_lag_cutoff>0`: Wave E.2 follow-up panel-block extension (library synthesis with within-PSU serial Bartlett HAC, requires effective PSU on the survey design) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rce absent CI's pure-Python and Rust matrix jobs copy `tests/` to an isolated location (per the "Copy tests to isolated location (Unix)" workflow step) WITHOUT the `docs/` tree, to verify the installed package doesn't depend on the source tree. Two of my new drift tests need to load `docs/tutorials/23_spillover_tva.ipynb` and fail with `FileNotFoundError: '/private/tmp/docs/tutorials/23_spillover_tva.ipynb'` on every non-source-tree job. Failing jobs (from run 26416449494): - Python Tests (macos-latest, py3.11/3.13/3.14) - Python Tests (windows-latest, py3.13) - Python Tests (ubuntu-24.04-arm, py3.14) Fix: both `test_notebook_dgp_constants_match_test_module_constants` and `test_notebook_dgp_ast_matches_test_fixture` now `pytest.skip()` gracefully when the notebook source isn't reachable, with a docstring note explaining that the sync-guard contract is meaningful in local dev (where edits happen) and the nbmake CI job separately verifies the notebook executes end-to-end. This preserves the codex R4/R6 P2 fixes (notebook source IS available in local dev runs and in the iterative codex-review workflow, so the guards still trigger on real edits) while making the suite robust to the project's isolated-tests CI pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
239eeb5 to
c3fd1de
Compare
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Summary
docs/tutorials/23_spillover_tva.ipynb(slot 23 — slot 22 occupied by22_had_survey_design.ipynb) walks a practitioner throughSpilloverDiDon a synthetic TVA-style spillover panel reproducing the Butts (2021) §4 Table 1 Panel A ~40% understatement direction. 7 sections / ~11 markdown + 6 code cells, capped with a Conley spatial-HAC variance demo that showcases the recent Spillover-Conley initiative (PRs SpilloverDiD survey_design= on HC1/CR1 via Binder TSL (Wave E.1) #468 / SpilloverDiD vcov_type='conley' + survey_design= via panel-aware stratified-Conley sandwich on per-period PSU totals (Wave E.2) #474 / SpilloverDiD conley + survey + lag>0 via panel-block composition (Wave E.2 follow-up) #477 / SpilloverDiD Wave E.3: SurveyDesign.subpopulation() full-design retention #482 / TwoStageDiD Wave E.3 parity: always-treated full-design retention #485 / spillover-conley: dedup serial Bartlett kernel + PSD guard helpers #489).tau_total=-7.4,delta_1=-4.5,d_bar=100 km. Naive multi-period TWFE (with explicitabsorb=["unit"]) lands at -4.29 — exactly 42% understatement of the true direct effect. SpilloverDiD withrings=[0.0, 100.0]cleanly recovers tau_total (-7.34) and delta_1 (-4.53).tests/test_t23_spillover_tva_drift.py(17 function-level tests mirroring T19 pattern) pin: panel composition, geographic bands, DGP-parameter constants (must stay in sync with the notebook §2 cell),SpilloverDiD.get_params()schema, naive-coef endpoint (round-to-1),tau_total/delta_1recovery,ringssensitivity grid (round-to-1 per reviewer guidance for BLAS safety on borderline-rank-deficientrings=[0,50]), d_bar≥100km estimate-identity invariance, Conley<HC1 direction pin, point-estimate invariance across vcov types, and the EXACT warning surface (Counter-multiset equality on the three known benign*encountered in matmulRuntimeWarnings).Methodology references
SpilloverDiD(Butts 2021 ring-indicator estimator + Gardner 2022 two-stage residualize-then-fit + Wave A/E.2 Conley spatial-HAC composition).docs/methodology/papers/butts-2021-review.md:257endorses the ~40% bias-correction direction as a tutorial-friendly DGP target).docs/references.rstin this PR.Validation
tests/test_t23_spillover_tva_drift.py(17 function-level tests, T19 pattern).docs/tutorials/23_spillover_tva.ipynbexecutes end-to-end underpytest --nbmakein ~2 seconds. The DGP-builder is duplicated verbatim into the drift-testpanelfixture per the T19 pattern (no_scratch/shared module —_scratch/is gitignored); thetest_dgp_true_parameters_match_quotedpin catches silent drift between the two copies.-W warnings as errorspasses; the toctree entry under Tutorials: Business Applications (aftertutorials/22_had_survey_design) is wired correctly.Counter-multiset equality, R9 P3 seed-23 geometry numbers + CHANGELOG count).docs/index.rsttoctree,docs/methodology/papers/butts-2021-review.md:257(T22→T23),docs/references.rst(Kline-Moretti 2014 added),TODO.md(drops the T22-tutorial row),CHANGELOG.md(one-liner under[Unreleased] ### Added).Security / privacy
Generated with Claude Code